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Abstract This paper introduces a novel framework for modeling interacting hu- 
mans in a multi-stage game. This "iterated semi network-form game" framework 
has the following desirable characteristics: (1) Bounded rational players, (2) strate- 
gic players (i.e., players account for one another's reward functions when predict- 
ing one another's behavior), and (3) computational tractability even on real-world 
systems. We achieve these benefits by combining concepts from game theory and 
reinforcement learning. To be precise, we extend the bounded rational "level-K rea- 
soning" model to apply to games over multiple stages. Our extension allows the 
decomposition of the overall modeling problem into a series of smaller ones, each 
of which can be solved by standard reinforcement learning algorithms. We call this 
hybrid approach "level-K reinforcement learning". We investigate these ideas in a 
cyber battle scenario over a smart power grid and discuss the relationship between 
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the behavior predicted by our model and what one might expect of real human de- 
fenders and attackers. 



1 Introduction 

We are interested in modeling something that has never been modeled before: the 
interaction of human players in a very complicated time-extended domain, such as a 
cyber attack on a power grid, when the players have little or no previous experience 
with that domain. Our approach combines concepts from game theory and computer 
science in a novel way. In particular, we introduce the first time-extended level-K 
game theory model ||9] [31] |37 1 . We solve this model using reinforcement learning 
(RL) algorithms |38| to learn each player's policy against the level K - 1 policies 
of the other players. The result is a non-equilibrium model of a complex and time- 
extended scenario where bounded-rational players interact strategically. Our model 
is computationally tractable even in real-world domains. 



1.1 Overview and Related Work 

The foundation of our approach is the use of a "semi-Bayes net" to capture the func- 
tional structure of a strategic game. A semi-Bayes net is essentially a Bayes net ||2TI 
where the conditional probability distributions for nodes representing player deci- 
sions are left unspecified. Combining a semi-Bayes net with utility functions for the 
players yields a "semi network-form game" (or semi net-form game) |24|, which 
takes the place of the extensive-form game ||30l used in conventional game the- 
ory[^In this paper, we extend the semi net-form game framework to a repeated-time 
structure by defining an "iterated semi net-form game". The conditional probability 
distributions at the player decision nodes are specified by combining the iterated 
semi net-form game with a solution concept, e.g., the level-K RL policies used in 
this paper The result is a Bayes net model of strategic behavior. 

Like all Bayes nets, our model describes the conditional dependence relation- 
ships among a set of random variables. In the context of a strategic scenario, con- 
ditional dependencies can be interpreted to describe, for example, the information 
available to a player while making a strategic decision. In this way, semi net-form 
games incorporate a notion similar to that of "information sets" found in extensive- 



The "semi-" modifier refers to a restricted category of models within a broader class of models 
called network-form games. A key difference between the semi-network form game used here 
and the general formulation of network-fonn games is that the general formulation can handle 
unawareness - situation where a player does not know of the possibility of some aspect of the 
game |42|. Unawareness is a major stumbling block of conventional game theoretic approaches 
in part because it forces a disequilibrium by presenting an extreme violation of the common prior 
assumption 1161 . 
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form games. However, information in semi net-form games takes on the nature of 
information in statistics, thereby opening it to formal analysis by any number of 
statistical tools Il22l [33l as opposed to information sets which uses an informal no- 
tion. Just as information sets are the key to capturing incomplete information in 
extensive-form games, conditional dependence relationships are the key to captur- 
ing incomplete information in semi net-form games In our example of a cyber 
battle, the cyber defender (power grid operator) has access to the full system state, 
whereas the cyber attacker only has access to the part of the system that has been 
compromised. Representing this in the semi net-form game diagram means the de- 
fender's decision node has the full system state as its parent, while the attacker's 
decision node only has a subset of the state as its parent. As a consequence, the 
attacker cannot distinguish between some of the system states. In the language of 
extensive-form games, we say that all states mapping to the same attacker's obser- 
vation belong to the same information set. 

It is important to recognize that the semi net-form game model is independent 
of a solution concept. Just as a researcher can apply a variety of equilibrium con- 
cepts (Nash equilibrium, subgame perfect equilibrium, quantal response equilib- 
rium 1271 |28l . etc.) to the same extensive-form game, so too can various solution 
concepts apply to the same semi net-form game. In this paper we focus on the use 
of level-K RL policies, however, there is no way in which the semi net-form games 
model is dependent on that concept. One could, in principle, apply Nash equilib- 
rium, subgame perfect equilibrium, quantal response equilibrium, etc. to a semi net- 
form game, though doing so may not result in a computationally tractable model or 
a good description of human behavior. 

In the remainder of this introduction, we describe three characteristics whose 
unique combination is the contribution of our paper. The first is that players in our 
model are strategic; that their policy choices depend on the reward functions of 
the other players. This is in contrast to learning-in-games and co-evolution mod- 
els IIT4l|20l wherein players do not use information about their opponents' reward 
function to predict their opponents' decisions and choose their own actions. On this 
point, we are following experimental studies |5|, which routinely demonstrate the 
responsiveness of player behavior to changes in the rewards of other players. 

Second, our approach is computationally feasible even on real- world problems. 
This is in contrast to equilibrium models such as subgame perfect equilibrium 
and quantal response equilibrium. We avoid the computational problems associ- 
ated with solving for equilibria by using the level-K RL policy model, which is a 
non-equilibrium solution concept. That is, since level-K players are not forced to 
have correct beliefs about the actions of the other players, the level-K strategy of 

Harsanyi's Bayesian games 1171 are a special case of extensive form games in which nature 
first chooses the game, and this move by nature generally belongs to different information sets 
for the different players. This structure converts the game of incomplete information to a game 
of imperfect information, i.e. the players have imperfectly observed nature's move. In addition to 
the fact that Harsanyi's used extensive form games in his work while we're using semi network- 
form games, our work also differs in what we are modeling. Harsanyi focused on incomplete 
information, while our model incorporates incomplete information and any other uncertainty or 
stochasticity in the strategic setting. 
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player / does not depend on the actual strategy of i's opponents. As a result, this 
means that the level-K RL policies of each of the players can be solved indepen- 
dently. The computational tractability of our model is also in contrast to partially 
observable Markov decision process- (POMDP-) based models (e.g. Interactive- 
POMDPs |15|) in which players are required to maintain belief states over belief 
states thus causing a quick explosion of the computational space. We circumvent 
this explosion of belief states by formulating policies as mappings from a player's 
memory to actions, where memory refers to some subset of a player's current and 
past observations, past actions, and statistics derived from those variables. This for- 
mulation puts our work more squarely in the literature of standard RL lfT8l [38^. As a 
final point of computational tractability, our approach uses the policy representation 
instead of the strategic representation of player decisions. The difference is that the 
policy representation forces player behavior to be stationary - the time index is not 
an argument of the policy - whereas in the strategic representation strategies are 
non-stationary in general. 

Third, since our goal is to predict the behavior of real human players, we 
rely heavily on the experimental game theory literature to motivate our modeling 
choices. Using the policy mapping from memories to actions, it is straightforward 
to introduce experimentally motivated behavioral features such as noisy, sampled 
or bounded memory. The result of the RL, then, is an optimal strategy given more 
realistic assumptions about the limitations of human beings This is in contrast to 
the literature on coevolutionary RL [13 , 29 1, where the goal is to find optimal strate- 
gies. For example, the work in IH uses RL to design expert checkers strategies. In 
those models, behavioral features motivated by human experimental data are not 
included due to the constraining eff'ect they have on optimal strategies. Hence, RL 
in our model is used as a description of how real humans behave. This use for RL 
has a foundation in neurological research fT2l |25]| . where it has provided a useful 
framework for studying and predicting conditioning, habits, goal-directed actions, 
incentive salience, motivation and vigor |26|. The level-K model is itself another 
way in which we incorporate experimentally motivated themes. In particular, by 
using the level-K model instead of an equilibrium solution concept, we avoid the 
awkward assumption that players' predictions about each other are always correct 

Sim ED. 

We investigate all of this for modeling a cyber battle over a smart power grid. We 
discuss the relationship between the behavior predicted by our model and what one 
might expect of real human defenders and attackers. 



^ One can imagine an extension where the RL training is modified to reflect bounded rationaUty, 
satisfying |35 1, etc. For example, to capture satisficing, the RL may be stopped upon achieving the 
satisficing level of utility. Note that we do not pursue such bounded rational RL here. 
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This chapter is organized as follows. In Section 2, we provide a review of semi 
network-form games and the level-K d-relaxed strategies solution concept ll24ll . 
This review is the starting point for the theoretical advances of this paper found 
in Section 3. In Section 3 we extend the semi net-form games formalism to iterated 
semi network-form games, which enables interactions over a time-repeated struc- 
ture. This is also where we introduce the level-K RL solution concept. Section 3 is 
the major theoretical contribution of this paper In Section 4, we apply the iterated 
semi net-form game framework to model a cyber battle on a smart power distribu- 
tion network. The goal of Section 4 is to illustrate how an iterated semi net-form 
game is realized and how the level-K RL policy solution concept is implemented. 
In this section we describe the setting of the scenario and lay out the iterated semi 
net-form game model, including observations, memories, moves and utility func- 
tions for both players. We also describe the details of the level-K RL algorithm we 
use to solve for players' policies. This section concludes with simulation results and 
a possible explanation for the resulting behaviors. Section 5 provides a concluding 
discussion of the iterated semi net-form games framework and future work. 



2 Semi Network-Form Games Review 

In this section, we provide a brief review of semi net-form games. For a rigorous 
treatment, please refer to Lee and Wolpert l,24J . 



2.1 Framework Description 

A "semi network-form game" (or semi net-form game) uses a Bayes net ETIl to 
serve as the underlying probabilistic framework, consequently representing all parts 
of the system using random variables. Non-human components such as automation 
and physical systems are described using "chance" nodes, while human components 
are described using "decision" nodes. Formally, chance nodes differ from decision 
nodes in that their conditional probability distributions are prespecified. In contrast, 
each decision node is associated with a utility function which maps an instantiation 
of the net to a real number quantifying the player's utility. In addition to know- 
ing the conditional distributions at the chance nodes, we must also determine the 
conditional distributions at the decision nodes to fully specify the Bayes net. We 
will discuss how to arrive at the players' conditional distributions over possible ac- 
tions, also called their "strategies", later in Section [272| The discussion is in terms of 
countable spaces, but much of the discussion carries over to the uncountable case. 
We describe a semi net-form game as follows: 
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An (A^-player) semi network-form game is described by a quintuple {G,X, u,R,n) 
where 

1 . G is a finite directed acyclic graph represented by a set of vertices and a set of 
edges. The graph G defines the topology of the Bayes network, thus specifying 
the random variables as well as the relationships between them. 

2. X is a Cartesian product of the variable space of all vertices. Thus X contains all 
instantiations of the Bayes network. 

3. M is a function that takes an instantiation of the Bayes network as input and out- 
puts a vector in R^, where component / of the output vector represents player 
fs utility of the input instantiation. We will typically view it as a set of utility 
functions where each one maps an instantiation of the network to a real number 

4. /? is a partition of the vertices into + 1 subsets. The first partitions contain 
exactly one vertex, and are used to associate assignments of decision nodes to 
players. In other words, each player controls a single decision node. The + 1 
partition contains the remainder of the vertices, which are the chance nodes. 

5. TT is a function that assigns to every chance node a conditional probability distri- 
bution [21 J of that node conditioned on the values of its parents. 

Specifically, is the set of all possible states at node v, m,- is the utility function 
of player /, R(i) is the decision node set by player /, and n is the fixed set of distri- 
butions at chance nodes. Semi net-form game is a general framework that has broad 
modeling capabilities. As an example, a normal-form game ll30l is a semi net-form 
game that has no edges. As another example, let v be a decision node of player / that 
has one parent, v'. Then the conditional distribution P{Xy \ X,/) is a generalization 
of an information set. 



2.2 Solution Concept: Level-K D-Relaxed Strategies 

In order to make meaningful predictions of the outcomes of the games, we must 
solve for the strategies of the players by converting the utility function at each deci- 
sion node into a conditional probability distribution over that node. This is accom- 
plished using a set of formal rules and assumptions applied to the players called a 
solution concept. A number of solution concepts have been proposed in the game 
theory literature. Many of which show promise in modeling real human behavior in 
game theory experiments, such as level-K thinking, quantal response equilibrium, 
and cognitive hierarchy. Although this work uses level-K exclusively, we are by 
no means wedded to this equilibrium concept. In fact, semi net-form games can 
be adapted to use other models, such as Nash equilibrium, quantal response equi- 
librium, quantal level-K, and cognitive hierarchy. Studies ||5j |43j have found that 
performance of an equilibrium concept varies a fair amount depending on the game. 
Thus it may be wise to use different equilibrium concepts for different problems. 

Level-K thinking [llj is a game theoretic solution concept used to predict the 
outcome of human-human interactions. A number of studies IIHIHISlIIOKTTll^Jl have 
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shown promising results predicting experimental data in games using this method. 
The concept of level-K is defined recursively as follows. A level K player plays 
(picks his action) as though all other players are playing at level K-\, who, in 
turn, play as though all other players are playing at level K -2, etc. This process 
continues until level is reached, where the player plays according to a prespecified 
prior distribution. Notice that running this process for a player K >2 results in 
ricocheting between players. For example, if player A is a level 2 player, he plays 
as though player B is a level 1 player, who in turn plays as though player A is a 
level player Note that player B in this example may not be a level 1 player in 
reaUty - only that player A assumes him to be during his reasoning process. Since 
this ricocheting process between levels takes place entirely in the player's mind, 
no wall clock time is counted (we do not consider the time it takes for a human 
to run through his reasoning process). We do not claim that humans actually think 
in this manner, but rather that this process serves as a good model for predicting 
the outcome of interactions at the aggregate level. In most games, the player's level 

is a fairly low number for humans; experimental studies |5 1 have found K to be 
somewhere between 1 and 2. 

In ||24| . the authors propose a novel solution concept called "level-K d-relaxed 
strategies" that adapts the traditional level-K concept to semi network-form games. 
The algorithm proceeds as follows. To form the best response of a decision node 
V, the associated player / = R~^{v) will want to calculate quantities of the form 
argmaXvJE(M,- | Xi,,Xpa(v))\, where m,- is the player's utility, x,, is the variable set by 
the player (i.e., his move), and Xpa(v) is the realization of his parents that he observes. 
We hypothesize that he (behaves as though he) approximates this calculation in sev- 
eral steps. First, he samples M candidate moves from a "satisficing" distribution (a 
prior distribution over his moves). Then, for each candidate move, he estimates the 
expected utility resulting from playing that move by sampling M' times the posterior 
probability distribution over the entire Bayes net given his parents and his actions 
(which accounts for what he knows and controls), and computing the sample ex- 
pectation uf . Decision nodes of other players are assumed to be playing at a fixed 
conditional probability distribution computed at level - 1 . Finally, the player picks 
the move that has the highest estimated expected utility. In other words, the player 
performs a finite-sample inference of his utility function using the information avail- 
able to him, then picks (out of a subset of all his moves) the move that yields the 
highest expected utility. For better computational performance, the algorithm reuses 
certain sample sets by exploiting the d-separation property of Bayes nets |21 1. The 
solution concept was used to model pilot behavior in a mid-air encounter scenario, 
and showed reasonable behavioral results. 



3 Iterated Semi Network-Form Games 



In the previous section, we described a method to model a single-shot scenario. 
That is, a scenario in which each player makes a single decision. However, most 
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real-world scenarios are not single-shot. Rather, what is typically seen is that the 
outcome is determined by a series of decisions made by each player over a time- 
repeated structure. One way to model time extension is to ignore the structure, create 
a large "rolled-out" nej^that explicitly enumerates the repeated nodes, then apply 
level-K d-relaxed strategies described in Section 2.2 The problem with such an 
approach is that the roll-out causes a linear explosion in the number of decision 
nodes with the number of time steps. Since the computational complexity of level- 
K d-relaxed strategies is polynomial (to the K''^ power) in the number of decision 
nodes f24 |, the algorithm becomes prohibitively slow in solving scenarios with more 
than a few time steps. 

In this section, we extend the semi network-form game from Section |2] to an 
"iterated semi network-form game" (or iterated semi net-form game) in order to ex- 
plicitly model the repeated-time structure of the game. Then we introduce a novel 
solution concept called "level-K reinforcement learning" that adapts level-K think- 
ing to the iterated semi network-form game setting. 



3.1 Construction of an Iterated Semi Network-Form Game 

We describe the extended framework by building up the components incrementally. 
A "semi Bayes net" is like a standard Bayes net, in that a semi Bayes net has a topol- 
ogy specified by a set of vertices and directed edges, and variable spaces that define 
the possible values each vertex can take on. However, unlike a standard Bayes net, 
some nodes have conditional probability distributions (CPDs) specified, whereas 
some do not. The nodes that do not have their CPDs specified are decision nodes 
with one node assigned to each player A pictorial example of a semi Bayes net is 
shown in Figure |l(a)| The dependencies between variables are represented by di- 
rected edges. The oval nodes are chance nodes and have their CPDs specified; the 
rectangular nodes are decision nodes and have their CPDs unspecified. In this paper, 
the unspecified distributions will be set by the interacting players and are specified 
by the solution concept. 

We create two types of semi Bayes nets: a "base semi Bayes net" and a "kernel 
semi Bayes net". A "base semi Bayes net" specifies the information available to all 
the players at the start of play, and is where the policy decisions of the game are 
made. Note that even though the game is time-extended, players only ever make 
one real decision. This decision concerns which policy to play, and it is made at 
the beginning of the game in the base semi Bayes net. After the policy decision is 
made, action decisions are merely the result of evaluating the policy at the current 
state. In contrast, the "kernel semi Bayes net" specifies both how information from 
the past proceeds to future instances of the players during play, and how the state of 
nature evolves during play. In particular, it specifies not only what a player currently 

Here we are violating the definition of a semi net-form game that each player can only control a 
single decision node. One way to deal with this is to treat past and future selves as different players, 
but having the same utility function. 
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Fig. 1 Example construction of an iterated semi Bayes net with a base net and two kernels, i.e. 
r = 2, by repeatedly applying the "gluing" procedure. |(a)| A base semi Bayes net. |(b)| A kernel 
semi Bayes net being "glued" to a base semi Bayes net. |(c)| A second kernel semi Bayes net being 
appended to the net. |(d)| The final semi iterated Bayes net with T = 2. The numeric subscript 
indicates the time step to which each variable belongs. 

observes, but also what they remember from their past observations and past actions. 
For example, the kernel semi Bayes net describes how the policy chosen in the base 
semi Bayes net is propagated to a player's future decision nodes, where a player's 
action choices are merely the result of evaluating that policy. From these two, we 
construct an "iterated semi Bayes net" by starting with the base semi Bayes net 
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then repeatedly appending the kernel semi Bayes net to it T times. Each append 
operation uses a "gluing" procedure that merges nodes from the first semi Bayes net 
to root nodes with the same spaces in the second semi Bayes net. Figure[T]illustrates 
how we build up an iterated semi Bayes net with a base net and two kernels, i.e. 
T -2. Finally, we create an "iterated semi net-form game" by endowing an iterated 
semi Bayes net with a reward function, one for each player, defined at each time 
instant. The reward function takes as input an instantiation of the net at a particular 
(discrete) time and outputs a reward metric representing how happy the player is 
with that instantiationP] 



3.2 Solution Concept: Level-K Reinforcement Learning 

We introduce a novel solution concept for iterated semi net-form games that com- 
bines level-K thinking and reinforcement learning. Instead of considering all possi- 
ble combinations of actions at individual decision nodes, we simplify the decision 
space by assuming that the players make only a single decision - what policy to play 
for the rest of the net. That is, the players pick a policy in the base semi Bayes net, 
and then executes that policy over all repetitions of the kernel semi Bayes net. This 
assumption allows us to convert the problem of computing a combination of actions 
over all time steps to one where we calculate a player's policy only once and reuse 
it T times. By reusing the policy, the computational complexity becomes indepen- 
dent of the total number of time steps. Formally, each unspecified node of a player 
contains two parts: A policy and an action. The policy is chosen in the base stage 
and is passed unchanged from the player's node in the base semi Bayes net to the 
player's node in the kernel semi Bayes net for all time steps. At each time step, the 
action component of the node is sampled from the policy based on the actual values 
of the node's parents. We point out that the utility of a particular policy depends on 
the policy decisions of other players because the reward functions of both players 
depend on the variables in the net. 

The manner in which players make decisions given this coupling is specified 
by the solution concept. In this work we handle the interaction between players by 
extending standard level-K thinking from action space to policy space. That is, in- 
stead of choosing the best level K action (assuming other players are choosing the 
best level K- \ action), players choose the best level K policy (assuming that other 
players choose their best level K- 1 policy). Instead of prespecifying a level distri- 
bution over actions, we now specify a level distribution over policies. Notice that 
from the perspective of a level K player, the behavior of the level K- \ opponents is 
identical to a chance node. Thus, to the player deciding his policy, the other players 
are just a part of his environment. Now what remains to be done is to calculate the 
best response policy of the player In level-K reinforcement learning, we choose the 

^ We use the term reward function to conform to the language used in the RL literature. This is 
identical to the game theoretic notion of instantaneous utility (as opposed to the total utility, i.e. 
the present discounted value of instantaneous utilities). 
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Utility of a player to be the sum of his rewards from each time step. In other words, 
the player selects the policy which leads to the highest expected infinite sum of dis- 
counted rewards. Noting this together with the fact that the actions of other players 
are identical to a stochastic environment, we see that the optimization is the same as 
a single-agent reinforcement learning problem where an agent must maximize his 
reward by observing his environment and choosing appropriate actions. There are 
many standard reinforcement learning techniques that can be used to solve such a 
problem l3i rT8lf38ll . The techniques we use in this paper are described in detail in 
Section |4A2] 

For example, in a two-player iterated semi network-form game, the level 1 policy 
of player A is trained using reinforcement learning by assuming an environment that 
includes a player B playing a level policy. If A is instead at level 2, his environment 
includes player B using a level 1 policy. Player A imagines this level 1 policy as 
having been reinforcement learned against a level player A. To save computation 
time, it is assumed that how player B learns his level 1 distribution and how A 
imagines B to learn his level 1 distribution are identical. 



4 Application: Cyber-Physical Security of a Power Network 
4.1 Introduction 

We test our iterated semi net-form game modeling concept on a simplified model 
of an electrical power grid controlled by a Supervisory Control and Data Acquisi- 
tion (SCADA) system ll39l . A SCADA system forms the cyber and communication 
components of many critical cyber physical infrastructures, e.g., electrical power 
grids, chemical and nuclear plants, transportation systems, and water systems. Hu- 
man operators use SCADA systems to receive data from and send control signals to 
physical devices such as circuit breakers and power generators in the electrical grid. 
These signals cause physical changes in the infrastructure such as ramping elec- 
trical power generation levels to maintain grid stability or modifying the electrical 
grid's topology to maintain the grid's resilience to random component failures. If 
a SCADA system is compromised by a cyber attack, the human attacker may alter 
these control signals with the intention of degrading operations or causing perma- 
nent, widespread damage to the physical infrastructure. 

The increasing connection of SCADA to other cyber systems and the use of com- 
puter systems for SCADA platforms is creating new vulnerabilities of SCADA to 
cyber attack fTl. These vulnerabilities increase the likelihood that the SCADA sys- 
tems can and will be penetrated. However, even when a human attacker has gained 
some control over the physical components, the human operators still have some 
SCADA observation and control capability. The operators can use this capability to 
anticipate and counter the attacker moves to limit or deny the damage and maintain 
continuity of the infrastructure's operation. Traditional cyber security research on 
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cyber systems has focused on identifying vulnerabilities and how to mitigate those 
vulnerabilities. Here, instead, we assume that an attacker has penetrated the system, 
and we want to predict the outcome. 

The SCADA attack and the defense by the SCADA operator can be modeled 
as a machine-mediated, human-human adversarial game. In the remainder of this 
section, we construct an iterated semi network-form game to model just such an 
interaction taking place over a simplified model of a SCADA-controlled electrical 
grid. The game is simulated using the level-K reinforcement learning solution con- 
cept described earlier. We explore how the strategic thinking embodied in level-K 
reinforcement learning affects the player performance and outcomes between play- 
ers of different level K. 



4.2 Scenario Model 

Figure |2] shows a schematic of our simplified electrical grid infrastructure. It con- 
sists of a single, radial distribution circuit |40| starting at the low- voltage side of a 
transformer at a substation (node 1) and serving customers at nodes 2 and 3. Node 2 
represents an aggregation of small consumer loads distributed along the circuit-such 
load aggregation is often done to reduce model complexity when simulating electri- 
cal distribution systems. Node 3 represents a relatively large, individually-modeled 
distributed generator located near the end of the circuit. 



Fig. 2 Schematic drawing of the three-node distribution circuit consisting of three nodes i. The 
voltage at each node is V,; the real and reactive power injections are p,- and qi, respectively; the 
line reactance and resistance are x, and r,, respectively; and the real and reactive power flows in 
the distribution lines are P, and Q, , respectively. 

In this figure, Vi,pi, and qi are the voltage and real and reactive power injec- 
tions at node /. Pi,Qi,ri, and jc, are the real power flow, reactive power flow, re- 
sistance, and reactance of circuit segment /. These quasi-static power injections, 
power flows, voltages, and line properties are related by the nonlinear AC power 
flow equations Il23l . Our focus in this work is on the game theoretic aspects of the 
model, therefore, we use a linearized description of the electrical power flow, i.e.. 




P2' ^2 
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the LinDistFlow equations ll40l 



V2 = Vi -(riPi +xiQi), V3 = V2-(r2P2+X2Q2). 



(1) 
(2) 



Here, all terms have been normalized by the nominal system voltage Vq f23\. 

In this model, we assume that the circuit configuration is constant with r,- = 0.03 
and Xi = 0.03. To emulate the normal fluctuations of consumer real load, p2 is drawn 
from a uniform distribution over the range [1.35, 1.5] at each time step of the game. 
The consumer reactive power is assumed to scale with real power, and we take 
q2 - 0.5/72 at each step of the game. The node 3 real power injection pj, - I is 
also taken as constant implying that, although the distributed generator at node 3 is 
controllable (as opposed to a fluctuating renewable generator), its output has been 
fixed. Node 3 is then a reasonable model of an internal combustion engine/generator 
set burning diesel or perhaps methane derived from landfill gas. Such distributed 
generation is becoming more common in electrical distribution systems. 

In our simplified game, the SCADA operator (defender) has one objective, i.e., 
keeping the voltages V2 and V3 within appropriate operating bounds (described in 
more detail below). To accomplish this the operator normally has two controls: 1) 
he can change the voltage Vi at the head of the circuit, and 2) he can adjust the 
reactive power output qj, of the distributed generator at node 3. However, we assume 
that the system has been compromised, and the attacker has taken control of (73 
while the defender retains control of Vi. In this circumstance, the attacker may 
use the injection of reactive power ^3 to modify all the Qi causing the voltage V2 to 
deviate significantly from 1 .0. Excessive deviation of V2 or V3 can damage customer 
equipment If23l or perhaps initiate a cascading failure beyond the circuit in question. 
In the language of an iterated semi network-form game, the change in Vi is the 
decision variable of the defender, ^3 is the decision variable of the attacker, and V2, 
Vj, and the rest of the system state are determined by the LinDistFlow equations 
and probability distribution described above. 

4.2.1 Players' Decision Spaces 

In this scenario, the defender maintains control of Vi which he can adjust in dis- 
crete steps via a variable-tap transformer [231, however, hardware-imposed limits 
constrain the defender's actions at time f to the following domain 



where 6v is the voltage step size for the transformer, and v,„,„ and v,„„;i represent 
the absolute min and max voltage the transformer can produce. In simple terms, 
the defender may leave Vi unchanged or move it up or down by dv as long as Vi 
stays within the range [v„„„, v„,a^]. In our model, we take v„„„ - 0.90, v„,„ = 1.10, 
and 6v = 0.02. Similarly, hardware Hmitations of the generator at node 3 constrain 



DD,t = {mm(v,„ax, Vi,r + 6v), yi,,,max(v,„,„, Vi,, - 6v)] 



(3) 
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the attacker's range of control of ^3. In reality, the maximum and minimum values 
of q-i can be a complicated function |23! of the maximum real power generation 
capability p^^max and the actual generation level pj,. To keep the focus on the game 
theoretic aspects of the model, we simplify this dependence by taking the attacker's 
qj, control domain to be 

Da,! - {-qi,max, • • • ,0, . . ■,q3,max], (4) 

with q3,max - P3,max- To reduce the complexity of the reinforcement learning com- 
putations, we also discretize the attacker's move space to eleven equally-spaced set- 
tings with -q3,max and +q3,max as the end points. Later, we study how the behavior 
and performance of the attacker depends on the size of the assets under his control 
by varying pj, from to 1.8. 



4.2.2 Players' Observed Spaces 

The defender and attacker make observations of the system state via the SCADA 
system and the attacker's compromise of node 3, respectively. Via the SCADA sys- 
tem, the defender retains wide system visibility of the variables important to his 
operation of the system, i.e., the defender's observed space is given by 

£3d^[VuV2,V3,PuQi,Md]- (5) 

Because he does not have access to the full SCADA system, the attacker's observed 
space is somewhat more limited 

QA = [V2,Vi,p3,q3,MA]- (6) 

Here, Md and Ma each denote real numbers that represent a handcrafted summary 
metric of the respective player's memory of the past events in the game. These are 
described in Section l4.2.4l 



4.2.3 Players' Rewards 

The defender desires to maintain a high quality of service by controlling the voltages 
V2 and V3 near the desired normalized voltage of 1 .0. In contrast, the attacker wishes 
to damage equipment at node 2 by forcing V2 beyond normal operating limits. Both 
the defender and attacker manipulate their controls in an attempt to maximize their 
own average reward, expressed through the following reward functions 

V2-I? IV3-l'^ 



= -K— r (7) 



Ra ^0(V2-(l + e)) + (9((1 - e) - V2). 



(8) 
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Here, e represents the halfwidth of the nominally good range of normalized voltage. 
For most distribution systems under consideration, e ~ 0.05. 0{-) is the step function. 

4.2.4 Players' Memory Summary Metrics 

The defender and attacker use memory of the system evolution in an attempt to 
estimate part of the state that is not directly observable. In principle, player memo- 
ries should be constructed based on specific application domain knowledge or inter- 
views with such experts. However, in this initial work, we simply propose a memory 
summary metric for each player that potentially provides him with additional, yet 
imperfect, system information. We define the defender memory summary metric to 
be 



If the attacker has very limited qj, capability, both pj, and ^3 are relatively constant, 
and changes in V3 should follow changes in Vi, which is directly controlled by the 
defender. If all Vt, changes are as expected, then = 1. The correlation between 
Vi and V3 changes can be broken by an attacker with high ^3 capability because 
large changes in ^3 can make Vi and V3 move in opposite directions. If attacker 
actions always cause Vi and V3 to move in opposite directions, then Md = -1- This 
correlation can also be broken by variability in the (unobserved) p2 and q2- The 
attacker could use this ip2,q2) variability, which is unobserved by the attacker, to 
mask his actions at node 3. Such masking is more important in a setting where the 
defender is uncertain of the presence of the attacker, which we will address in future 
work. 

As with the defender memory summary metric, the intent of Ma is to estimate 
some unobserved part of the state. Perhaps the most important unobserved state 
variable for the attacker is Vi which reveals the vulnerability of the defender and 
would be extremely valuable information for the attacker. If the attacker knows the 
rules that the defender must follow, i.e.. Equation (|3]l, he can use his observations to 
infer Vi . One mathematical construct that provides this inference is 



If the attacker increases ^3 by Aqj,j - qj,^t -<j3,t-i, he would expect a proportional 
increase in V'3 by /iVj,^t - ^'3,,- V3,r-i ~ Aq^X2/Vo. If V'3 changes according to this 
reasoning, then the argument in Ma is zero. However, if the defender adjusts Vi at 
the same time step, the change in V'3 would be modified. If AV^^t is greater or lower 
than the value expected by the attacker by AV/N, the argument in Ma is H-1 or -1, 
respectively. The sum then keeps track of the net change in Vi over the previous m 
time steps. Note also that the stochastic load {p2,q2) will also cause changes in V3 
and, if large enough, it can effectively mask the defender behavior from the attacker. 




(9) 



u=t—m 
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We model the scenario described in Section 4.2 as an iterated semi net-form game 
set in the graph shown in Figure [3] The figure shows the net for 2 time steps with 
the numeric subscript on each variable denoting the time step to which it belongs. 
The system state S = [^'2>G2>^i>^b V2, V3] is a vector that represents the current 
state of the power grid network. The vector comprises of key system variables 
with their relationships defined in Equations ([T]l and The observation nodes 
Od = [Vi,V2,V3,Pi,Qi] and Oa - [^2, V3,P3,^3] are vectors representing the part 
of the system state that is observed by the defender and attacker, respectively. We 
compute these observation nodes by taking the system state S , and passing through 
unchanged only the variables that the player observes. Each player's observation is 
incorporated into a memory node (Mo and Ma for the defender and attacker, respec- 
tively) that summarizes information from the player's past and present. The memory 
node^are given by M^,, = [Od, W(D,f,£>D,r-i] and Ma,, = [OA,t,MA,r,DA,t-i]- Now, 
the defender uses his memory Mjj to set the decision node Do, which adjusts the 
setting of the voltage-tap transformer (up to one increment in either direction) and 
sets the voltage Vi . On the other hand, the attacker uses his memory Ma to set the 
decision node Da, which sets q^. Finally, the decisions of the players are propagated 
to the following time step to evolve the system state. In our experiments we repeat 
this process for T = 100 time steps. 



4.4 Computing the Solution Concept 

We compute the level-K policies of the players following the level-K reinforcement 



learning solution concept described in Section 3.2 First, we form the base of the 
level-K hierarchy by defining level policies for the defender and attacker. Then, 
we describe the details of how we apply reinforcement learning to bootstrap up to 
levels K>Q. A level policy represents a prior on the player's policy, i.e., it defines 
how a non-strategic player would play. In this work, we handcrafted level policies 
based on expert knowledge of the domain. In future work, we would like to devise 
an automated and somewhat "principled" way of setting the level policies. 



4.4.1 Level Policies 

Often, level players are assumed to choose their moves randomly from their move 
spaces DD,t and Da,,. However, we do not believe this to be a good assumption, es- 
pecially for SCADA operators. These operators have training which influences how 



* To be technically correct, we must also include the variables carried by the memory nodes Mo 
and Ma for the sole purpose of calculating Md and Ma, respectively. However, for simplicity, we 
ai'e not showing these variables explicitly. 



Counter-Factual Reinforcement Learning 



17 




Fig. 3 Tiie iterated semi net-form game graph of the cyber security of a smart power network 
scenario. The graph shows 2 time steps explicitly. In our experiments we choose the number of 
time steps T = 100. We use subscripts D and A to denote node association with the defender 
and attacker, respectively, and the numeric subscript to denote the time step. The system state S 
represents the current state of the power grid network. The players make partial observations O of 
the system and use them to update their memories M. The memories are used to pick their action 
D. 



they control the system when no attacker is present, i.e., the "normal" state. In con- 
trast, a random-move assumption may be a reasonable model for a level attacker 
that has more knowledge of cyber intrusion than of manipulation of the electrical 
grid. However, we assume that our level attacker also has some knowledge of the 
electrical grid. 

If there is no attacker present on the SCADA system, the defender can maximize 
his reward by adjusting V\ to move the average of V2 and Vt, closer to 1 .0 without 
any concern for what may happen in the future. We take this myopic behavior as 
representative of the level defender, i.e., 

T/ ^ • (^2,f + V'3,r) 

T^D{V2,t, y^i) = ^"gmin£>D,, 2 ^ 
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For the level attacker, we adopt a drift- and- strike policy which requires some 
knowledge of the physical circuit and power flow equations. We propose that the 
attacker "drifts" in one direction by steadily increasing (or decreasing) ^3 by one 
increment at each time step. The level attacker decides the direction of the drift 
based on V2, i.e., the attacker drifts to larger ^3 if V2 < 1. The choice of V2 to decide 
the direction of the drift is somewhat arbitary. However, this is simply assumed level 
attacker behavior The drift in ^3 causes a drift in Qi and, without any compen- 
sating move by the defender, a drift in Vi. However, a level defender compensates 
by drifting Vi in the opposite sense as V2 in order to keep the average of V2 and Vt, 
close to 1 .0. The level attacker continues this slow drift until, based on his knowl- 
edge of the power flow equations and the physical circuit, he detects that a sudden 
large change in ^3 in the opposite direction of the drift would push V2 outside the 
range [1 - e, 1 H- e]. If the deviation of V2 is large enough, it will take the defender 
a number of time steps to bring V2 back in range, and the attacker accumulates re- 
ward during this recovery time. More formally this level attacker policy can be 
expressed as 
LevelOAttackerO 

1 y* =max^eD^,|y2-l|; 

2 ^V*>0A 

3 then return arg max^yeo^ , I V'2 - 1 1 ; 

4 if y2 < 1 

5 then return ^3.,- 1 + 1 ; 

6 return ^3 ,-1 - 1; 

Here, 0^ is the threshold parameter that triggers the strike. Throughout this work, 
we have used 9a - 0.07 > e to indicate when an attacker strike will accumulate 
reward. 



4.4.2 Reinforcement Learning Details 

The training environment of a level-K player consists of all nodes that he does not 
control, including all chance nodes and the decision nodes of other players, which 
are assumed to be playing with a level K- \ policy. This leaves us with a standard 
single-agent reinforcement learning problem, where given an observation, the player 
must choose an action to maximize some notion of cumulative reward. We follow 
loosely the SARSA reinforcement learning setup in |38|. First, we choose the opti- 
mization objective to be his expected sum of discounted single-step rewards (given 
by Equations |7] and [8]l. To reduce the output space of the player, we impose an e- 
greedy parameterization on the player's policy space. That is, the player plays what 
he thinks is the "best" action with probability 1 - e, and plays uniformly randomly 
over all his actions with probability e. Playing all possible actions with nonzero 
probability ensures sufficient exploration of the environment space for learning. At 
the core of the SARSA algorithm is to learn the "Q-function", which is a map- 
ping from observations and actions to expected sum of discounted rewards (also 
known as "Q-values"). Given an observation of the system, the Q-function gives the 



Counter-Factual Reinforcement Learning 



19 



long-term reward for playing a certain action. To maximize the reward gathered, the 
player simply plays the action with the highest Q-value at each step. 

To learn the Q-function, we apply the one-step SARSA on-policy algorithm 
in iPlj^However, since the players' input spaces are continuous variables, we can- 
not use a table to store the learned Q-values. For this reason, we approximate the 
Q-function using a neural-network 13] |34l . Neural networks are a common choice 
because of its advantages as a universal function approximator and being a compact 
representation of the policy. 

To improve stability and performance, we make the following popular modifi- 
cations to the algorithm: First, we run the algorithm in semi-batch mode, where 
training updates are gathered and updated at the end of the episode rather than fol- 
lowing each time step. Second, we promote initial exploration using optimistic starts 
(high initial Q-values) and by scheduling the exploration parameter e to a high rate 
of exploration at first, then slowly decreasing it as the training progresses. 



4.5 Results and Discussion 

Level-K reinforcement learning was performed for all sequential combinations of 
attacker and defender pairings, i.e., Dl/AO, D2/A1, Al/DO, and A2/D1. Here, we 
refer to a level K player using a shorthand where the letter indicates attacker or 
defender and the number indicates the player's level. The pairing of two players 
is indicated by a "/". The training was performed for q-i^max in the range 0.2 to 
1.8. Subsequent to training, simulations were run to assess the performance of the 
different player levels. The player's average reward per step for the different pairs is 
shown in Figure|4]as a function of q^^mix- Figure |5] shows snapshots of the players' 
behavior for the pairings DO/AO, Dl/AO, and DO/Al for ^3,,„„x =0.7, 1.2, and 1.6. 
Figure [6] shows the same results but for one level higher, i.e., Dl/Al, D2/A1, and 
D1/A2. 



DO/AO 



Figures 
Figures 



IJb), (e), 
g a) and ( 



and (h) show the interaction between the two level policies, and 
(d) show the average player performance. These initial simulations 
set the stage for interpreting the subsequent reinforcement learning. For qi^ma.x <0.8, 
the black circles in Figure |4|d) show that AO is unable to push V2 outside of the 
range [1 - e, 1 -H e]. The explanation is found in Figure |5jb). With V2 < 1 and say 
q3,max - 0.7, AO's drift will have saturated at qj, - q^^max = 0.7. However, with 6yi - 
0.07, AO will not strike by changing q^, = -q^^max - -0.7 unless he projects such 
a strike could drive V2 below 0.93. AO's limited ^3-strike capability is not enough 



' Singh et al. |36 | describes the characteristics of SARSA when used in partially observable situa- 
tions. SARSA will converge to a reasonable policy as long as the observed variables are reasonably 
Markov. 
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overcome the threshold and the system becomes locked in a quasi-steady state. In 
the midrange of AO's capability (0.8 < qi < 1.4), the drift-and-strike AO policy is 
effective (Figure |5je)). However, AO is only successful for strikes that force V2 < 
0.95. In addition, there are periods of time when y2 ~ 1.0 and AO is unable to decide 
on a drift direction. However, these become fewer (and AO's average reward grows) 
^3, max approaches 1.4 (Figure |4|d)). For q^^max 

> 1 .6, AO is able to successfully 
strike for V2 < 0.93 and V2 > 1.07, and AO drives the system into a nearly periodic 
oscillation (Figure |5]^h)) with a correspondingly large increase in AO's discounted 
average reward (Figure |4|d)). The reduction in DO's performance closely mirrors 
the increase in AO's performance as ^3 increases. However, it is important to note 
that DO enables much of AO's success by changing V\ to chase the V2 and V3. The 
adjustments in Vi made by DO in Figures |5jb), (e), and (h) bring the system closer 
to the voltage limits just as AO gains a large strike capability. 



Dl Training Versus AO 

The red triangles in Figure |4|a) and the black circles in Figure Qe) show dramatic 
improvement in the performance of Dl over DO when faced with AO. In the mid- 
dle range of AO's capability (0.8 < qj^max ^ 1-4), Figure |5jd) shows that Dl stops 
changing Vi to chase the immediate reward sought by DO. Instead, Dl maintains 
a constant Vi - 1.02 keeping V2 ~ 1.0 and AO uncertain about which direction to 
drift. By keeping Vi > 1.0, Dl also corrects the error of DO whose lower values of 
Vi helped AO push V2 and below 1 - e. With Vi = 1 .02, the average of V2 and V'3 
are significantly higher than 1.0, but Dl accepts the immediate decrement in average 
reward to avoid a much bigger decrement he would suffer from an AO strike. The 
effect of this new strategy is also reflected in the poor AO performance as seen from 
the black circles in Figure |4|e). The behavior of Dl for q^^max ^ 1.6 in Figure |5|g) 
becomes complex. However, it appears that Dl has again limited the amount that he 
chases V2 and V3. In fact, Dl moves Vi in a way that decreases his immediate re- 
ward, but this strategy appears to anticipate AO's moves and effectively cuts off and 
reverses AO in the middle of his drift sequence. We note that this behavior of the 
defender makes sense because he knows that the attacker is there waiting to strike. 
In real life, a grid operator may not realize that a cyber attack is even taking place. 
To capture this phenomenon motivates follow-on work in uncertainty modeling of 
the attacker's existence. 



Al Training Versus DO 

A cursory inspection of Figures |5jc), (f), and (i) might lead one to believe that the 
Al training has resulted in Al simply oscillating qj, back and forth from +q3^max to 
-q3,max- However, the training has resulted in rather subtle behavior, which is most 
easily seen in Figure [Sj^c). The largest change Al (with qi^max = 0.7) can indepen- 
dently make in V2 is ~ 0.04. However, Al gains an extra 0.02 of voltage change by 
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leveraging (or perhaps convincing) Dl to create oscillations of Vi in-phase with his 
own moves. For this strategy to be effective in pushing V2 below 1 - e, the Vi oscil- 
lations have to take place between 1.0 and 1.02, or lower When the synchronization 
of the Vi and Al oscillations are disturbed such as at around step 50 in Figure |5|c), 
Al modifies his move in the short term to delay the move by DO and re-establish the 
synchronization. Al also appears to have a strategy for "correcting" DO's behavior 
if the oscillations take place between levels Vi that are too high. Near step 40 in 
Figure |5] Al once again delays his move convincing DO to make two consecutive 
downward moves of Vi to re-establish the "correct" DO oscillation level. Similar 
behavior is observed out to qj^max = 1-4. At qs^max = 1-6, Al has enough capability 
that he can leverage in-phase DO oscillations to exceed both the V2 lower and upper 
voltage limits. This improved performance is reflected in the dramatic increase in 
Al's average reward (Al/DO; see red triangles in Figure|4|d)). 

Dl/Al 

In the hierarchy of level-K reinforcement learning, D 1/A 1 is similar to DO/ AO in that 
they do not train against one another, but this match up sets the stage for interpreting 
the level-2 trainings. Figures |5| a), (d), and (g) show that the Dl/AO training results 
in a D 1 that does not chase V2 and Vs , keeps V2 near 1 .0, and accepts a lower current 
reward to avoid large AO strikes. In Figures|6|b), (e), and (h), Dl continues to avoid 
responding to the oscillatory behavior of Al, y2 generally does not cross beyond the 
acceptable voltage limits. However, Vo, is allowed to deviate significantly beyond the 
bounds. The result is that Dl's average reward versus Al does not show much if any 
improvement over DO's versus Al (red triangles and black circles, respectively, in 
Figure Hlb)). However, Dl is quite effective and reducing the performance of Al 
(Figures|4je) red triangles) relative to the performance of Al in DO/Al, at least 
for the intermediate values of q^^max (Figure[4|d) red triangles). The results for Al 
are clearer Figures |6]^b), (e), and (h) show the oscillatory behavior of Al while 
Figures Qa), (b), (d), and (e) show that the switch from AO to Al when facing Dl 
improves the attacker's performance while degrading the performance of Dl. 

D2 Training Versus Al 

The results of this training start out similar to the training for Dl. Figure |6| a) shows 
that, at q^^max — 0.7, D2 performs better if he does not make many changes of V\ 
thereby denying Al the opportunity to leverage his moves to amplify the swings of 
'V2- For the higher values of q^j^x in Figures[6|d) and (g), D2 learns to anticipate the 
move pattern of Al and moves in an oscillatory fashion, but one that is out of phase 
with the moves of Al. Instead of amplifying the swings of V2, D2's moves attenuate 
these swings. This new behavior results in across-the-board improvement in D2's 
average discounted reward over Dl (blue squares versus red triangles in Figure|4jb) 
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and a significant reduction in Al performance (red triangles in Figure |4je) versus 
Figuregf)). 

A2 Training Versus Dl 

A2 shows no perceptible increase in performance over Al when matched against 
Dl (blue squares versus red triangles in Figure|4|e)). The same general observation 
can be made for A2 and Al when matched against any of DO, Dl, or D2. Fig- 
ures. |4|b) and (c) show that the defenders perform nearly the same against Al or 
A2, and Figures Qe) and (f) show no significant change in attacker performance 
when switching from Al to A2. This may indicate that policies embodied in A2 (or 
Al) may be approaching a fixed point in performance. 



D2/A2 

The similarities in the performance of Al and A2 make the analysis of this interac- 
tion nearly the same as that of D2/A1. 



5 Conclusions and Future Work 

In this paper, we introduced a strategic, computationally-tractable, experimentally- 
motivated model for predicting human behavior in novel and complex time-extended 
scenarios. This model consists of an iterated semi net-form game combined with a 
level-K RL solution concept. We applied this model to predict behavior on a cyber 
battle on a smart power grid. As discussed in the results section, the predictions of 
this model are promising in that they match expectations for how a "real world" 
cyber battle would unfold. 

We can vary parameters of the model that both concern the kind of cyber battle 
taking place (e.g., degree of compromise) and that describe the players (e.g., level 
distributions, their level K). We can also vary the control algorithm. We can then 
evaluate the expected "social welfare" (i.e., the happiness metric of the system de- 
signer) for all such variations. In this way our framework can be used to increase 
our understanding of existing and proposed control algorithms to evaluate their ro- 
bustness under different cyber attack scenarios and/or model mis-specification. In 
the near future, with additional advances in our computational algorithms, we hope 
to be able to solve the model in real-time as well. This raises the possibility of using 
our framework to do real-time control rather than choose among some small set of 
proposed control algorithms, i.e., to dynamically predict the attacker's policy and 
respond optimally as the cyber battle unfolds. 

Despite the significant modeling advances presented here, there are several im- 
portant ways in which the realism of this paper's model can be improved. Some of 
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these improvements have akeady been formalized, but they were left out of this doc- 
ument for the purposes of space and clarity. For example, the iterated semi net-form 
game framework easily models the situation where players have uncertainty about 
the environment they are facing. This includes uncertainty about the utility func- 
tions and the rationality (or levels) of the other players. This naturally corresponds 
to the Bayesian games setting within the extensive form games formalism. This also 
includes uncertainty about whether or not the other players exist. In fact, the semi 
net-form game formalism is unique in that it can even be extended to handle "un- 
awareness" - a situation where a player does not know of the possibility of some 
aspect of the game. For example, it would be unawareness, rather than uncertainty, 
if the defender did not know of the possibility that an attacker could take control of 
a portion of the smart power grid. These types of uncertainty and unawareness will 
be presented and explored in future work. 

Another important modeling advance under development is related to the abil- 
ity of players to adapt their policies as they interact with their opponents and make 
observations of their opponents' actual behavior. The level-K RL solution concept 
is particularly well-suited to relatively short-term interactions, like the cyber battle 
analyzed above. However, as interactions draw out over a longer time-frame, we 
would expect the players to incorporate their opponent's actual behavior into their 
level-K model of their opponent. One possibility for achieving this type of adapta- 
tion is based on a player using a Bayesian variant of fictitious play to set the level 
distribution of their opponent. In other words, we use the past behavior to update 
the level distribution of the opponent. 

This discussion raises an important question about what happens when the strate- 
gic situation is not novel and/or the players have previously interacted. Is the level- 
K RL model developed here still appropriate? The answer is probably no. In such 
an interacted environment, we should expect the players to have fairly accurate 
beliefs about each other. Furthermore, these accurate beliefs should lead to well- 
coordinated play. For example, in the power grid this would mean that the attacker 
and defender have beliefs that correspond to what the other is actually doing rather 
than corresponding to some independent model of the other's behavior. In the very 
least, we should not expect the players to be systematically wrong about each other 
as they are in the level-K model. Rather, in this interacted environment, player be- 
havior should be somewhere between the completely non-interacted level-K models 
and a full-on equilibrium, such as Nash equilibrium or quantal response equilibrium. 
The analysis of interacted, one-shot games found in Bono and Wolpert 11,41] should 
provide a good starting point for developing a model of an interacted, time-extended 
game. 

Perhaps the most important next step for this work is the process of estimating 
and validating our model using real data on human behavior. We specifically need 
data to estimate the parameters of the utility functions and the level K of the play- 
ers as well as any parameters of their level strategies. After fitting our model to 
data, we will validate our model against alternative models. The difficult part about 
choosing alternative models with which to compare our model is that extensive-form 
games and equilibrium concepts are computationally intractable in the types of do- 
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mains for which our model is designed. Therefore, feasible alternative models will 
likely be limited to simplified versions of the corresponding extensive-form game 
and agent-based simulations of our iterated semi net-form game. 

For the smart grid cyber battle analyzed in this paper, there are several options 
for gathering data. One is to conduct conventional game-theoretic experiments with 
human subjects in a laboratory setting. Unfortunately, estimating our model, espe- 
cially with the modeling advances discussed above, will require more data than is 
practical to collect via such conventional experimental methods which involve ac- 
tual power grid operators in realistic settings. An alternative method for collecting 
the large amount of data required is via "crowd-sourcing". In other words, it should 
be possible to deploy an internet-application version of our smart grid cyber battle 
to be played by a mixture of undergraduates, researchers, and power engineers. The 
data from these experiments would then be used to estimate and validate our model. 

The methodologies presented here, and the proposed future extensions, also ap- 
ply to many other scenarios. Among these are several projects related to cyber secu- 
rity as well as the Federal Aviation Administration's NextGen plan for modernizing 
the National Airspace System. To encompass this range of applications, we are de- 
veloping libNFG as a code base for implementing and exploring NFGs [24\. The 
development of this library is ongoing, and modeling advances, like those men- 
tioned above, will be implemented as they become an accepted part of the modeling 
framework. The libNFG library will ultimately be shared publicly and will enable 
users to fully customize their own iterated semi net-form game model and choose 
from a range of available solution concepts and computational approaches. 
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Fig. 4 Average reward per step averaged over 50 episodes as a function of qi^mux for all pairings 
of the defender (D) and attacker (A) through level 2. (a) Reward of DO, Dl, and D2 when matched 
against AO. (b) Same as (a) but for Al. (c) Same as (a) and (b) but for A2. (d) Reward of AO, Al, 
and A2 when matched against DO. (e) Same as (d) but for Dl. (f) Same as (d) and e) but for D2. 
In general, we observe that as qjj„ax increases, the defender's average reward decreases and the 
attacker's average reward increases. 
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Fig. 5 Simulations of system voltages for level and level 1 that show the evolution in level 
1 attacker (Al) and level 1 defender (Dl) policies after reinforcement learning training session 
against their level counterparts DO and AO. (a) Dl versus AO, (b) DO versus AO, and (c) DO versus 
Al for q3j„a_, = 0.7. (d) Dl versus AO, (e) DO versus AO, and (f) DO versus Al for i73,,„,u = 1.2. (g) 
Dl versus AO, (h) DO versus AO, and (i) DO versus Al for q3^„mx = 1-6. In the center column (DO 
versus AO), the attacker becomes increasingly capable of scoring against the defender as q3j„a.t is 
increased. In the left column (Dl versus AO), the defender is successful at avoiding attacks by not 
chasing small immediate rewards from voltage centering. In the right column (DO versus Al), the 
attacker successfully leverages the level defender's move to help him score. 
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Fig. 6 Simulations of system voltages for level 1 and level 2 that show the evolution in level 
2 attacker (A2) and level 2 defender (D2) policies after reinforcement learning training session 
against their level 1 counterparts D 1 and Al. (a) D2 versus Al, (b) Dl versus Al, and (c) Dl versus 
A2 for qjjjiax = 0.7. (a) D2 versus Al, (b) Dl versus Al, and (c) Dl versus A2 for qi^max = 1.2. (g) 
D2 versus Al, (h) Dl versus Al, and (i) Dl versus A2 for q^jnax = 1.6. 



